Visualizing student schedules

Case study in solving problems with Polars and Altair

Eric Monson, PhD

Duke University Libraries
Center for Data and Visualization Sciences

Advisor Student Heatmaps

Faculty need a quick way of summarizing busy and free times for the group of students they advise

  • This ends up being a nice case study for using basic Polars
    + visualization with Altair (the default plot module)
  • Along the way I’ll review syntax and contrast with Pandas

Where are we headed?

Students’ schedules

Students’ advisors

All schedules/advisor

Typical import of modules

import polars as pl
import altair as alt

Preview student schedules

pl.read_csv('./data/students_deidentified.csv', n_rows=5)
shape: (5, 7)
StudentID StudentName Descr Subject Pat Mtg Start Mtg End
str str str str str str str
"062264a" "Joseph Garner" "INTRODUCTORY MECHANICS" "PHYSICS" "WF" "1:25:00 PM" "2:40:00 PM"
"062264a" "Joseph Garner" "ACADEMIC WRITING" "WRITING" "WF" "3:05:00 PM" "4:20:00 PM"
"062264a" "Joseph Garner" "MATRICES AND VECTORS" "MATH" "TTH" "10:05:00 AM" "11:20:00 AM"
"062264a" "Joseph Garner" "INTRO TO SIGNALS AND SYSTEMS" "ECE" "F" "10:05:00 AM" "12:55:00 PM"
"062264a" "Joseph Garner" "ADV TOPICS IN DEEP LEARNING" "ECE" "MW" "11:45:00 AM" "1:00:00 PM"
  • Polars only supports UTF-8 text encoding for fast reads
  • Not doing a lazy read/DataFrame here since the files aren’t big
  • “Mtg Start” and “Mtg End” look like times but are strings right now
  • If the IDs were integers we’d force them to be read as a strings
    to not lose any leading zeros

Full CSV read w/time conversion

students_df = (pl.read_csv('./data/students_deidentified.csv')
                .with_columns(start=pl.col('Mtg Start').str.to_time('%r'),
                                end=pl.col('Mtg End').str.to_time('%r'))
                .drop(pl.col('Mtg Start','Mtg End'))
           )

print('Full DataFrame shape:', students_df.shape)
students_df.head(10)
Full DataFrame shape: (2976, 7)
shape: (10, 7)
StudentID StudentName Descr Subject Pat start end
str str str str str time time
"062264a" "Joseph Garner" "INTRODUCTORY MECHANICS" "PHYSICS" "WF" 13:25:00 14:40:00
"062264a" "Joseph Garner" "ACADEMIC WRITING" "WRITING" "WF" 15:05:00 16:20:00
"062264a" "Joseph Garner" "MATRICES AND VECTORS" "MATH" "TTH" 10:05:00 11:20:00
"062264a" "Joseph Garner" "INTRO TO SIGNALS AND SYSTEMS" "ECE" "F" 10:05:00 12:55:00
"062264a" "Joseph Garner" "ADV TOPICS IN DEEP LEARNING" "ECE" "MW" 11:45:00 13:00:00
"062264a" "Joseph Garner" "INTRODUCTORY MECHANICS" "PHYSICS" "M" 13:30:00 15:30:00
"062264a" "Joseph Garner" "INTRODUCTORY MECHANICS" "PHYSICS" "T" 13:30:00 15:30:00
"062264a" "Joseph Garner" "HOUSE COURSE (SP TOP)" "HOUSECS" "M" 17:15:00 18:45:00
"74b05b7" "Vanessa Thompson" "INTERMEDIATE MECHANICS" "PHYSICS" "WF" 10:05:00 11:20:00
"74b05b7" "Vanessa Thompson" "INTRO EXPERIMENTAL PHYSICS I" "PHYSICS" "T" 15:45:00 17:45:00

Method chaining

(pl.read_csv('./data/students_deidentified.csv')
    .with_columns(start=pl.col('Mtg Start').str.to_time('%r'),
                  end=pl.col('Mtg End').str.to_time('%r'))
    .drop(pl.col('Mtg Start','Mtg End'))
)
  • Just chaining further processing onto read_csv()
  • (…code…) alows breaks at dots

Adding new columns

(pl.read_csv('./data/students_deidentified.csv')
    .with_columns(start=pl.col('Mtg Start').str.to_time('%r'),
                  end=pl.col('Mtg End').str.to_time('%r'))
    .drop(pl.col('Mtg Start','Mtg End'))
)
  • .with_columns() – adds new or modifies existing columns

Adding new columns

(pl.read_csv('./data/students_deidentified.csv')
    .with_columns(pl.col('Mtg Start').str.to_time('%r').alias('start'),
                  end=pl.col('Mtg End').str.to_time('%r'))
    .drop(pl.col('Mtg Start','Mtg End'))
)
  • .with_columns() – adds new or modifies existing columns

Naming your new columns

(pl.read_csv('./data/students_deidentified.csv')
    .with_columns(pl.col('Mtg Start').str.to_time('%r').alias('start'),
                  end=pl.col('Mtg End').str.to_time('%r'))
    .drop(pl.col('Mtg Start','Mtg End'))
)

Two format options:

expression.alias('name')
or
name=expression

  • Second preferred for readability
  • Note: Python doesn’t allow “positional” arguments after “keyword” arguments, so can’t use .alias() after name=

Converting strings to times

(pl.read_csv('./data/students_deidentified.csv')
    .with_columns(start=pl.col('Mtg Start').str.to_time('%r'),
                  end=pl.col('Mtg End').str.to_time('%r'))
    .drop(pl.col('Mtg Start','Mtg End'))
)

Dropping unnecessary columns

(pl.read_csv('./data/students_deidentified.csv')
    .with_columns(start=pl.col('Mtg Start').str.to_time('%r'),
                  end=pl.col('Mtg End').str.to_time('%r'))
    .drop(pl.col('Mtg Start','Mtg End'))
)
  • .drop() – only drops columns, unlike Pandas

Existing column names

(pl.read_csv('./data/students_deidentified.csv')
    .with_columns(start=pl.col('Mtg Start').str.to_time('%r'),
                  end=pl.col('Mtg End').str.to_time('%r'))
    .drop(pl.col('Mtg Start','Mtg End'))
)

Two format options:

pl.col('name')Polars expression
or
'name'plain string shortcut

  • First more flexible – the basis for building complex expressions
  • First necessary for lazy evaluation, optimization, and parallel execution

Read advisors DataFrame

advisors_df = pl.read_csv('./data/advisors_deidentified.csv')
advisors_df
shape: (352, 4)
StudentID StudentName AdvisorName AdvisorID
str str str str
"f1010e9" "Misty Lee" "Sheri Mosley" "28db778"
"7d8a6c1" "Rebecca Thomas" "Sheri Mosley" "28db778"
"3d66893" "Karen Clark" "Sheri Mosley" "28db778"
"f151ba8" "Maxwell Kirby" "Sheri Mosley" "28db778"
"d48c1d6" "Maria Robertson" "Sheri Mosley" "28db778"
"fde1944" "Gina Wolfe" "Philip Gallagher" "5cb893c"
"1315a19" "Dalton Wu" "Philip Gallagher" "5cb893c"
"52fbc5f" "Heidi Stafford" "Philip Gallagher" "5cb893c"
"13704ca" "Neil Caldwell" "Philip Gallagher" "5cb893c"
"dc434aa" "Brooke Whitney" "Philip Gallagher" "5cb893c"

JOIN students and advisors

  • JOIN on StudentID
  • StudentName is in both DataFrames so can drop one of them
  • Filling null AdvisorName with placeholder string
students_advisors_df = (
    students_df.join(advisors_df, on='StudentID', how='left')
          .drop(pl.col('StudentName_right'))
          .with_columns(pl.col('AdvisorName').fill_null(pl.lit('No Advisor')))
         )
students_advisors_df.sample(10)
shape: (10, 9)
StudentID StudentName Descr Subject Pat start end AdvisorName AdvisorID
str str str str str time time str str
"593358a" "Alyssa Cross" "ADVANCED INTRO PROBABILITY" "MATH" "MWF" 10:20:00 11:10:00 "Maurice Maldonado" "5af7c52"
"da572ef" "Allen Villegas" "ADVANCED INTRO PROBABILITY" "MATH" "TH" 16:40:00 17:55:00 "Phillip Hardy" "0e670a6"
"83668c4" "Krystal Stokes" "INTERMEDIATE MECHANICS" "PHYSICS" "WF" 10:05:00 11:20:00 "Miranda Curtis" "5eefe5c"
"74d51f9" "Elaine Drake" "INTERMEDIATE MECHANICS" "PHYSICS" "W" 18:00:00 20:00:00 "Beth Mcconnell" "213c159"
"f80be5b" "Catherine Lucas" "INTERMEDIATE MECHANICS" "PHYSICS" "T" 18:00:00 20:00:00 "Angie Moon" "e7e131b"
"65bd890" "Barbara Wagner" "INTRO ELECTRIC, MAGNET, OPTICS" "PHYSICS" "F" 12:00:00 14:30:00 "Henry Schroeder" "e6d6592"
"e34a50c" "Jasmine Harmon" "INTRO EXPERIMENTAL PHYSICS I" "PHYSICS" "M" 13:30:00 15:30:00 "Chad Reese" "26c9a35"
"711b837" "Carol Garrett" "RESEARCH PROJECTS IN EGR" "EGR" null null null "Lindsey Burnett" "c0d3c80"
"8fdc182" "Melanie Ortega" "RESEARCH PROJECTS IN EGR" "EGR" null null null "Chelsea Fernandez" "413a839"
"8ccbeeb" "Casey French" "QUANT PHYSIOLOGY BIOSTAT APPL" "BME" "TTH" 08:30:00 09:45:00 "Luke Pace" "4f3f9d1"

Exporatory aside

Some students have no advisor

(students_advisors_df
 .filter(pl.col('AdvisorName') == "No Advisor")
 .select(pl.col('StudentID','StudentName'))
 .unique()
)
shape: (22, 2)
StudentID StudentName
str str
"2444c0d" "Nicolas Holland"
"95f9885" "Austin Fischer"
"5fe5193" "Daniel Miller"
"67c5589" "Julie Keller"
"9ff23f4" "Randy Williams"
"de5e9c9" "Jenny Reyes"
"dce891d" "Leonard Armstrong"
"630cfdd" "Beverly Wright"
"99c59cc" "Bruce Pratt"
"9d04ed2" "Caitlin Adams"

Filter returns subset of rows

(students_advisors_df
 .filter(pl.col('AdvisorName') == "No Advisor")
 .select(pl.col('StudentID','StudentName'))
 .unique()
)

.filter()

Select returns subset of columns

(students_advisors_df
 .filter(pl.col('AdvisorName') == "No Advisor")
 .select(pl.col('StudentID','StudentName'))
 .unique()
)

.filter()

.select()

Unique removes repeated rows

(students_advisors_df
 .filter(pl.col('AdvisorName') == "No Advisor")
 .select(pl.col('StudentID','StudentName'))
 .unique()
)
shape: (22, 2)
StudentID StudentName
str str
"630cfdd" "Beverly Wright"
"c299e53" "Crystal Becker"
"2444c0d" "Nicolas Holland"
"95f9885" "Austin Fischer"
"a763934" "Jim Fisher"
"67c5589" "Julie Keller"
"dce891d" "Leonard Armstrong"
"1ef6826" "Brendan Dodson"
"de5e9c9" "Jenny Reyes"
"6a2617f" "Leslie Willis"

Some classes have no start time

(students_advisors_df
 .filter(pl.col('start').is_null())
 .group_by(pl.col('Subject','Descr'))
 .agg(count = pl.col('StudentID').count())
 .sort('count', descending=True)
)
shape: (13, 3)
Subject Descr count
str str u32
"EGR" "RESEARCH PROJECTS IN EGR" 52
"CEE" "ENGINEERING THE PLANET" 10
"MUSIC" "SYMPHONY ORCHESTRA" 2
"MUSIC" "INTRO GUITAR CLASS" 2
"MUSIC" "WIND SYMPHONY" 1
"MUSIC" "FLUTE" 1
"ISS" "INFORMATION, SOCIETY & CULTURE" 1
"EGR" "DESIGN TO DELIVER" 1
"HLTHPOL" "BASS CONNECTION HEALTH POLICY" 1
"MUSIC" "CLARINET" 1

Filter to null start time

(students_advisors_df
 .filter(pl.col('start').is_null())
 .group_by(pl.col('Subject','Descr'))
 .agg(count = pl.col('StudentID').count())
 .sort('count', descending=True)
)
  • Remember, .filter() lets through what is True

Group by Subject & Description

(students_advisors_df
 .filter(pl.col('start').is_null())
 .group_by(pl.col('Subject','Descr'))
 .agg(count = pl.col('StudentID').count())
 .sort('count', descending=True)
)
  • Get a separate group for each unique combination of the two variables
  • Unlike Pandas
    • .group_by() returns grouping variables as normal columns
    • There is no Index in Polars DataFrames

Aggregate by count of ID

(students_advisors_df
 .filter(pl.col('start').is_null())
 .group_by(pl.col('Subject','Descr'))
 .agg(count = pl.col('StudentID').count())
 .sort('count', descending=True)
)
  • Perform this aggregation on each group
  • Could put more expressions in .agg() separated by commas

Sort by count, descending

(students_advisors_df
 .filter(pl.col('start').is_null())
 .group_by(pl.col('Subject','Descr'))
 .agg(count = pl.col('StudentID').count())
 .sort('count', descending=True)
)
  • Ascending sort order is the default
  • Note: Pandas is opposite for descending: ascending=False

Some classes have no start time

(students_advisors_df
 .filter(pl.col('start').is_null())
 .group_by(pl.col('Subject','Descr'))
 .agg(count = pl.col('StudentID').count())
 .sort('count', descending=True)
)
shape: (13, 3)
Subject Descr count
str str u32
"EGR" "RESEARCH PROJECTS IN EGR" 52
"CEE" "ENGINEERING THE PLANET" 10
"MUSIC" "INTRO GUITAR CLASS" 2
"MUSIC" "SYMPHONY ORCHESTRA" 2
"MUSIC" "THEOR/PRAC TONAL MUS I" 1
"HLTHPOL" "BASS CONNECTION HEALTH POLICY" 1
"EGR" "DESIGN TO DELIVER" 1
"MUSIC" "MEET THE BEATLES AND THE 1960S" 1
"MUSIC" "WIND SYMPHONY" 1
"ISS" "INFORMATION, SOCIETY & CULTURE" 1

Back to data prep

Dealing with days of the week

StudentID StudentName Descr Subject Pat start end AdvisorName AdvisorID
062264a Joseph Garner INTRODUCTORY MECHANICS PHYSICS WF 13:25:00 14:40:00 Henry Schroeder e6d6592
062264a Joseph Garner ACADEMIC WRITING WRITING WF 15:05:00 16:20:00 Henry Schroeder e6d6592
062264a Joseph Garner MATRICES AND VECTORS MATH TTH 10:05:00 11:20:00 Henry Schroeder e6d6592
062264a Joseph Garner INTRO TO SIGNALS AND SYSTEMS ECE F 10:05:00 12:55:00 Henry Schroeder e6d6592
062264a Joseph Garner ADV TOPICS IN DEEP LEARNING ECE MW 11:45:00 13:00:00 Henry Schroeder e6d6592
062264a Joseph Garner INTRODUCTORY MECHANICS PHYSICS M 13:30:00 15:30:00 Henry Schroeder e6d6592
062264a Joseph Garner INTRODUCTORY MECHANICS PHYSICS T 13:30:00 15:30:00 Henry Schroeder e6d6592
062264a Joseph Garner HOUSE COURSE (SP TOP) HOUSECS M 17:15:00 18:45:00 Henry Schroeder e6d6592
74b05b7 Vanessa Thompson INTERMEDIATE MECHANICS PHYSICS WF 10:05:00 11:20:00 Dan Sims eced0a0
74b05b7 Vanessa Thompson INTRO EXPERIMENTAL PHYSICS I PHYSICS T 15:45:00 17:45:00 Dan Sims eced0a0
  • Need to spell out days of the week for the visualization
  • One day per row in the DataFrame for aggregation

Aside: styling with great_tables

from great_tables import loc, style

(students_advisors_df.head(6).style
    .tab_style(
        style=[style.text(weight='bold'), style.fill("#FCF7E5")],
        locations=loc.body(columns="Pat")
    )
)
StudentID StudentName Descr Subject Pat start end AdvisorName AdvisorID
062264a Joseph Garner INTRODUCTORY MECHANICS PHYSICS WF 13:25:00 14:40:00 Henry Schroeder e6d6592
062264a Joseph Garner ACADEMIC WRITING WRITING WF 15:05:00 16:20:00 Henry Schroeder e6d6592
062264a Joseph Garner MATRICES AND VECTORS MATH TTH 10:05:00 11:20:00 Henry Schroeder e6d6592
062264a Joseph Garner INTRO TO SIGNALS AND SYSTEMS ECE F 10:05:00 12:55:00 Henry Schroeder e6d6592
062264a Joseph Garner ADV TOPICS IN DEEP LEARNING ECE MW 11:45:00 13:00:00 Henry Schroeder e6d6592
062264a Joseph Garner INTRODUCTORY MECHANICS PHYSICS M 13:30:00 15:30:00 Henry Schroeder e6d6592

Explicitly define day lookup patterns

  • Decide whether you want the full or abbreviated forms of the weekdays in the vis
days_version = "Abbreviation"
# days_version = "Full"

classdays_abbrev_dict = {'TTH':['Tues','Thurs'], 
                  'MW':['Mon','Wed'], 
                  'MTW':['Mon','Tues','Wed'],
                  'MTTH':['Mon','Tues','Thurs'],
                  'M':['Mon'], 
                  'WF':['Wed','Fri'], 
                  'F':['Fri'], 
                  'TH':['Thurs'], 
                  'MWF':['Mon','Wed','Fri'], 
                  'MTH':['Mon','Thurs'], 
                  'MTWF':['Mon','Tues','Wed','Fri'], 
                  'MF':['Mon','Fri'], 
                  'M-TH':['Mon','Tues','Wed','Thurs'],
                  'M-F':['Mon','Tues','Wed','Thurs','Fri'],
                  'T':['Tues'], 
                  'MT':['Mon','Tues'], 
                  'W':['Wed'],
                  'TF':['Tues','Fri']}

classdays_full_dict = {'TTH':['Tuesday','Thursday'], 
                  'MW':['Monday','Wednesday'], 
                  'MTW':['Monday','Tuesday','Wednesday'],
                  'MTTH':['Monday','Tuesday','Thursday'],
                  'M':['Monday'], 
                  'WF':['Wednesday','Friday'], 
                  'F':['Friday'], 
                  'TH':['Thursday'], 
                  'MWF':['Monday','Wednesday','Friday'], 
                  'MTH':['Monday','Thursday'], 
                  'MTWF':['Monday','Tuesday','Wednesday','Friday'], 
                  'MF':['Monday','Friday'], 
                  'M-TH':['Monday','Tuesday','Wednesday','Thursday'],
                  'M-F':['Monday','Tuesday','Wednesday','Thursday','Friday'],
                  'T':['Tuesday'], 
                  'MT':['Monday','Tuesday'], 
                  'W':['Wednesday'],
                  'TF':['Tuesday','Friday']}

if days_version.lower() == "abbreviation":
    classdays_dict = classdays_abbrev_dict
else:
    classdays_dict = classdays_full_dict

Parsing for day pattern lookups

  • This version adapts to the patterns present (better for new data but not more readable)
# days_version = "full"
days_version = "abbreviation"

days_letters = ['M', 'T', 'W', 'H', 'F']
days_list_full = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
days_list_abbrev = ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri']

# Using Pandas Series so can use slice notation on day ranges with dashes
if days_version.lower() == "full":
    classdays_series = pd.Series(dict(zip(days_letters, days_list_full)))
else:
    classdays_series = pd.Series(dict(zip(days_letters, days_list_abbrev)))

classdays_dict = {}
for day_pattern in students_advisors_df.get_column('Pat').unique().to_list():
    if day_pattern is not None:                             # There are some null day patterns
        day_pattern_noTH = day_pattern.replace('TH','H')    # Thurs only two-character abbrev
        if '-' in day_pattern_noTH:
            # Handle patterns like 'M-W', 'T-H', etc.
            match = re.search(r'([A-Z])-([A-Z])', day_pattern_noTH)
            classdays_dict[day_pattern] = classdays_series[slice(match.group(1),match.group(2))].to_list()
        else:
            # Handle single day patterns like 'MWF', 'TTH', etc.
            classdays_dict[day_pattern] = classdays_series[list(day_pattern_noTH)].to_list()


classdays_dict

Parsing for day pattern lookups

{'MWF': ['Mon', 'Wed', 'Fri'],
 'F': ['Fri'],
 'W': ['Wed'],
 'MF': ['Mon', 'Fri'],
 'T': ['Tues'],
 'TTH': ['Tues', 'Thurs'],
 'MW': ['Mon', 'Wed'],
 'MTTH': ['Mon', 'Tues', 'Thurs'],
 'WF': ['Wed', 'Fri'],
 'M': ['Mon'],
 'TH': ['Thurs']}

Test the day range replacements

students_advisors_df.select(pl.col('Pat'), 
                            weekday=pl.col('Pat').replace_strict(classdays_dict))
shape: (2_976, 2)
Pat weekday
str list[str]
"WF" ["Wed", "Fri"]
"WF" ["Wed", "Fri"]
"TTH" ["Tues", "Thurs"]
"F" ["Fri"]
"MW" ["Mon", "Wed"]
"T" ["Tues"]
"TTH" ["Tues", "Thurs"]
"MW" ["Mon", "Wed"]
"W" ["Wed"]
"TTH" ["Tues", "Thurs"]

Replace and explode into rows

  • Other columns get repeated
sa_w_weekdays = (students_advisors_df
           .with_columns(weekday=pl.col('Pat').replace_strict(classdays_dict))
           .explode('weekday')
         )
sa_w_weekdays.select(pl.col('StudentName','Descr','Pat','weekday'))
shape: (4_624, 4)
StudentName Descr Pat weekday
str str str str
"Joseph Garner" "INTRODUCTORY MECHANICS" "WF" "Wed"
"Joseph Garner" "INTRODUCTORY MECHANICS" "WF" "Fri"
"Joseph Garner" "ACADEMIC WRITING" "WF" "Wed"
"Joseph Garner" "ACADEMIC WRITING" "WF" "Fri"
"Joseph Garner" "MATRICES AND VECTORS" "TTH" "Tues"
"Judith Norris" "ADV TOPICS IN DEEP LEARNING" "MW" "Mon"
"Judith Norris" "ADV TOPICS IN DEEP LEARNING" "MW" "Wed"
"Judith Norris" "INTRO TO SIGNALS AND SYSTEMS" "W" "Wed"
"Judith Norris" "MATRICES AND VECTORS" "TTH" "Tues"
"Judith Norris" "MATRICES AND VECTORS" "TTH" "Thurs"

Test time_ranges()

  • Make list of times between “start” and “end”
  • Decided to set closed='left' so don’t include the end time in list

Test time_ranges()

  • Make list of times between “start” and “end”
  • Decided to set closed='left' so don’t include the end time in list
(sa_w_weekdays
 .with_columns(Time=pl.time_ranges("start", "end", interval='5m', closed='left'))
 .select(pl.col('StudentName','weekday','start','end','Time'))
)
shape: (4_624, 5)
StudentName weekday start end Time
str str time time list[time]
"Joseph Garner" "Wed" 13:25:00 14:40:00 [13:25:00, 13:30:00, … 14:35:00]
"Joseph Garner" "Fri" 13:25:00 14:40:00 [13:25:00, 13:30:00, … 14:35:00]
"Joseph Garner" "Wed" 15:05:00 16:20:00 [15:05:00, 15:10:00, … 16:15:00]
"Joseph Garner" "Fri" 15:05:00 16:20:00 [15:05:00, 15:10:00, … 16:15:00]
"Joseph Garner" "Tues" 10:05:00 11:20:00 [10:05:00, 10:10:00, … 11:15:00]
"Judith Norris" "Mon" 15:05:00 16:20:00 [15:05:00, 15:10:00, … 16:15:00]
"Judith Norris" "Wed" 15:05:00 16:20:00 [15:05:00, 15:10:00, … 16:15:00]
"Judith Norris" "Wed" 10:05:00 12:55:00 [10:05:00, 10:10:00, … 12:50:00]
"Judith Norris" "Tues" 08:30:00 09:45:00 [08:30:00, 08:35:00, … 09:40:00]
"Judith Norris" "Thurs" 08:30:00 09:45:00 [08:30:00, 08:35:00, … 09:40:00]

Explode time ranges into rows

  • Make list of times between “start” and “end”
  • Decided to set closed='left' so don’t include the end time in list
(sa_w_weekdays
 .with_columns(Time=pl.time_ranges("start", "end", interval='5m', closed='left'))
 .select(pl.col('StudentName','weekday','start','end','Time'))
 .explode('Time')
)
shape: (75_247, 5)
StudentName weekday start end Time
str str time time time
"Joseph Garner" "Wed" 13:25:00 14:40:00 13:25:00
"Joseph Garner" "Wed" 13:25:00 14:40:00 13:30:00
"Joseph Garner" "Wed" 13:25:00 14:40:00 13:35:00
"Joseph Garner" "Wed" 13:25:00 14:40:00 13:40:00
"Joseph Garner" "Wed" 13:25:00 14:40:00 13:45:00
"Judith Norris" "Thurs" 08:30:00 09:45:00 09:20:00
"Judith Norris" "Thurs" 08:30:00 09:45:00 09:25:00
"Judith Norris" "Thurs" 08:30:00 09:45:00 09:30:00
"Judith Norris" "Thurs" 08:30:00 09:45:00 09:35:00
"Judith Norris" "Thurs" 08:30:00 09:45:00 09:40:00

Time back to string for visualization

class_day_time_df = (
    sa_w_weekdays
     .with_columns(Time=pl.time_ranges("start", "end", interval='5m', closed='left'))
     .explode('Time')
     .select(pl.col('Time').dt.strftime("%H:%M"),
             pl.col('weekday').alias('Day'),
             pl.col('Descr').alias('Class'),
             pl.col('StudentName'),
             pl.col('AdvisorName')
            )
)

class_day_time_df
  • Altair doesn’t support the Polars “time” data type
  • Only keep the columns we’ll need for the visualization
    • Using the .alias() form in .select() since prefer this column order and can’t put “positional” arguments after “keyword” ones

Time back to string for visualization

shape: (75_247, 5)
Time Day Class StudentName AdvisorName
str str str str str
"13:25" "Wed" "INTRODUCTORY MECHANICS" "Joseph Garner" "Henry Schroeder"
"13:30" "Wed" "INTRODUCTORY MECHANICS" "Joseph Garner" "Henry Schroeder"
"13:35" "Wed" "INTRODUCTORY MECHANICS" "Joseph Garner" "Henry Schroeder"
"13:40" "Wed" "INTRODUCTORY MECHANICS" "Joseph Garner" "Henry Schroeder"
"13:45" "Wed" "INTRODUCTORY MECHANICS" "Joseph Garner" "Henry Schroeder"
"09:20" "Thurs" "MATRICES AND VECTORS" "Judith Norris" "Angie Moon"
"09:25" "Thurs" "MATRICES AND VECTORS" "Judith Norris" "Angie Moon"
"09:30" "Thurs" "MATRICES AND VECTORS" "Judith Norris" "Angie Moon"
"09:35" "Thurs" "MATRICES AND VECTORS" "Judith Norris" "Angie Moon"
"09:40" "Thurs" "MATRICES AND VECTORS" "Judith Norris" "Angie Moon"

Lists for visualization

  • Need an ordered list of times for the vis row ordering
  • List of weekday names sets the vis column order
  • List of advisors we will facet by
times_list = (class_day_time_df
 .get_column('Time')
 .unique()
 .sort()
 .to_list()
)
# Look at a few entries...
print(times_list[:5])

# Sets the visualization column order
if days_version.lower()=="abbreviation":
    days_list = ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri']
else:
    days_list = ['Monday','Tuesday','Wednesday','Thursday','Friday']

advisors_list = (class_day_time_df
 .get_column('AdvisorName')
 .unique()
 .sort()
 .to_list()
)
print('There are', len(advisors_list), 'advisors')
[None, '08:30', '08:35', '08:40', '08:45']
There are 27 advisors

Visualization

Visualization

  • Finally!  :-)
  • Grouping and aggregation done in Polars so:
    • not too many data rows in Altair
    • faster
      • Altair has built-in aggregation
      • It’s not very fast
      • VegaFusion helps with speed + number of rows issue

Start with a single advisor

advisor = advisors_list[1]
print(advisor)
Beth Mcconnell

DataFrame for a single advisor

single_advisor_df = (
  class_day_time_df
    .filter(pl.col('AdvisorName') == advisor)
    .group_by(pl.col('AdvisorName','Day','Time'))
    .agg(Count=pl.col('Class').count(),
         Students=pl.col('StudentName').str.join('; '))
)

single_advisor_df.select(
    pl.all().exclude('AdvisorName')
    )
shape: (542, 4)
Day Time Count Students
str str u32 str
"Thurs" "11:20" 1 "Christy Curry"
"Wed" "09:30" 7 "Lori Bennett; Wendy Barr; Ralp…
"Wed" "11:15" 2 "Elaine Drake; Joanne Torres"
"Fri" "10:15" 2 "Elaine Drake; Joanne Torres"
"Mon" "18:45" 1 "Taylor Freeman"
"Wed" "10:55" 9 "Wendy Barr; Ralph Hernandez; C…
"Fri" "14:30" 4 "Ralph Hernandez; Francisco Jua…
"Tues" "11:35" 1 "Taylor Freeman"
"Tues" "09:25" 3 "Taylor Freeman; Elaine Drake; …
"Wed" "18:35" 4 "Christy Curry; Francisco Juare…

Using Altair for a single advisor

single_advisor_df = (
  class_day_time_df
    .filter(pl.col('AdvisorName') == advisor)
    .group_by(pl.col('AdvisorName','Day','Time'))
    .agg(Count=pl.col('Class').count(),
         Students=pl.col('StudentName').str.join('; '))
)

single_advisor_df.plot.rect(
    x='Day',
    y='Time',
    color='Count',
    tooltip=['Day','Time','Count','Students']
)

Using Altair for a single advisor

single_advisor_df = (
  class_day_time_df
    .filter(pl.col('AdvisorName') == advisor)
    .group_by(pl.col('AdvisorName','Day','Time'))
    .agg(Count=pl.col('Class').count(),
         Students=pl.col('StudentName').str.join('; '))
)

single_advisor_df.plot.rect(
    x='Day:O',
    y='Time:O',
    color='Count:Q',
    tooltip=['Day','Time','Count','Students']
).properties(
        width=180,
        height=500
)

Using Altair for a single advisor

(class_day_time_df
    .filter(pl.col('AdvisorName') == advisor)
    .group_by(pl.col('AdvisorName','Day','Time'))
    .agg(Count=pl.col('Class').count(),
         Students=pl.col('StudentName').str.join('; '))
).plot.rect(
    x='Day:O',
    y='Time:O',
    color='Count:Q',
    tooltip=['Day','Time','Count','Students']
).properties(
        width=180,
        height=500
)

Using Altair for a single advisor

(class_day_time_df
    .filter(pl.col('AdvisorName') == advisor)
    .group_by(pl.col('AdvisorName','Day','Time'))
    .agg(Count=pl.col('Class').count(),
         Students=pl.col('StudentName').str.join('; '))
    .plot.rect(
        x='Day:O',
        y='Time:O',
        color='Count:Q',
        tooltip=['Day','Time','Count','Students']
    ).properties(
            width=180,
            height=500
    )
)

Using Altair for a single advisor

(class_day_time_df
    .filter(pl.col('AdvisorName') == advisor)
    .group_by(pl.col('AdvisorName','Day','Time'))
    .agg(Count=pl.col('Class').count(),
         Students=pl.col('StudentName').str.join('; '))
    .plot.rect(
        x=alt.X('Day:O'),
        y=alt.Y('Time:O'),
        color=alt.Color('Count:Q'),
        tooltip=['Day','Time','Count','Students']
    ).properties(
            width=180,
            height=500
    )
)

Using Altair for a single advisor

(class_day_time_df
    .filter(pl.col('AdvisorName') == advisor)
    .group_by(pl.col('AdvisorName','Day','Time'))
    .agg(Count=pl.col('Class').count(),
         Students=pl.col('StudentName').str.join('; '))
    .plot.rect(
        x=alt.X('Day:O', 
            axis=alt.Axis(labelAngle=0),
            sort=days_list, 
            scale=alt.Scale(domain=days_list), 
            title=''),
        y=alt.Y('Time:O'),
        color=alt.Color('Count:Q'),
        tooltip=['Day','Time','Count','Students']
    ).properties(
            width=180,
            height=500
    )
)

Using Altair for a single advisor

(class_day_time_df
    .filter(pl.col('AdvisorName') == advisor)
    .group_by(pl.col('AdvisorName','Day','Time'))
    .agg(Count=pl.col('Class').count(),
         Students=pl.col('StudentName').str.join('; '))
    .plot.rect(
        x=alt.X('Day:O', 
            axis=alt.Axis(labelAngle=0),
            sort=days_list, 
            scale=alt.Scale(domain=days_list), 
            title=''),
        y=alt.Y('Time:O', 
            title='time of day', 
            scale=alt.Scale(domain=times_list),
            axis=alt.Axis(labelOverlap=True)),
        color=alt.Color('Count:Q'),
        tooltip=['Day','Time','Count','Students']
    ).properties(
            width=180,
            height=500
    )
)

Using Altair for a single advisor

(class_day_time_df
    .filter(pl.col('AdvisorName') == advisor)
    .group_by(pl.col('AdvisorName','Day','Time'))
    .agg(Count=pl.col('Class').count(),
         Students=pl.col('StudentName').str.join('; '))
    .plot.rect(
        x=alt.X('Day:O', 
            axis=alt.Axis(labelAngle=0),
            sort=days_list, 
            scale=alt.Scale(domain=days_list), 
            title=''),
        y=alt.Y('Time:O', 
            title='time of day', 
            scale=alt.Scale(domain=times_list),
            axis=alt.Axis(labelOverlap=True)),
        color=alt.Color('Count:Q', 
            scale=alt.Scale(scheme='blues'), 
            legend=alt.Legend(title='# students')),
        tooltip=['Day','Time','Count','Students']
    ).properties(
            width=180,
            height=500
    )
)

Using Altair for a single advisor

(class_day_time_df
    .filter(pl.col('AdvisorName') == advisor)
    .group_by(pl.col('AdvisorName','Day','Time'))
    .agg(Count=pl.col('Class').count(),
         Students=pl.col('StudentName').str.join('; '))
    .plot.rect(
        x=alt.X('Day:O', 
            axis = alt.Axis(labelAngle=0), # default vertical labels
            sort=days_list, 
            scale=alt.Scale(domain=days_list), title=''),
        y=alt.Y('Time:O', 
            title='time of day', 
            scale=alt.Scale(domain=times_list),
            axis=alt.Axis(labelOverlap=True)), # confusing naming
        color=alt.Color('Count:Q', 
            scale=alt.Scale(scheme='blues'), 
            legend=alt.Legend(title='# students')),
        tooltip=['Day','Time','Count','Students']
    ).properties(
        width=180,
        height=500,
        title=advisor
    )
)

Faceted by advisor

  • Using VegaFusion for this since over 14,000 rows without filtering
alt.data_transformers.enable("vegafusion")
DataTransformerRegistry.enable('vegafusion')
  • Filtering out “No Advisor” (gets the most counts) for better colormap range

Faceted by advisor

(class_day_time_df
    .filter(pl.col('AdvisorName')!='No Advisor')
    .group_by(pl.col('AdvisorName','Day','Time'))
    .agg(Count=pl.col('Class').count(),
         Students=pl.col('StudentName').str.join('; '))
    .plot.rect(
        x=alt.X('Day:O', 
            axis = alt.Axis(labelAngle=0), 
            sort=days_list, 
            scale=alt.Scale(domain=tuple(days_list)), title=''),
        y=alt.Y('Time:O', 
            title='time of day', 
            scale=alt.Scale(domain=times_list),
            axis=alt.Axis(labelOverlap='parity')),
        color=alt.Color('Count:Q', 
            scale=alt.Scale(scheme='blues'), 
            legend=alt.Legend(title='# students')),
        tooltip=['Day','Time','Count','Students']
    ).properties(
        width=150, height=300, title=advisor
    ).facet(
        facet="AdvisorName",
        columns=5
    )
)

Faceted by advisor

Thanks!